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In this article we propose a novel MCMC method based on deterministic transformations T : 
XxV — > X where X is the state-space and V is some set which may or may not be a subset of X. 
We refer to our new methodology as Transformation-based Markov chain Monte Carlo (TMCMC). 
One of the remarkable advantages of our proposal is that even if the underlying target distribution is 
very high-dimensional, deterministic transformation of a one-dimensional random variable is suffi- 
cient to generate an appropriate Markov chain that is guaranteed to converge to the high-dimensional 
target distribution. Apart from clearly leading to massive computational savings, this idea of de- 
terministically transforming a single random variable very generally leads to excellent acceptance 
t-h rates, even though all the random variables associated with the high-dimensional target distribution 

are updated in a single block. Since it is well-known that joint updating of many random variables 
using Metropolis-Hastings (MH) algorithm generally leads to poor acceptance rates, TMCMC, in this 
regard, seems to provide a significant advance. We validate our proposal theoretically, establishing 
the convergence properties. Furthermore, we show that TMCMC can be very effectively adopted for 
simulating from doubly intractable distributions. 

We show that TMCMC includes hybrid Monte Carlo (HMC) as a special case. We also contrast 
TMCMC with the generalized Gibbs and Metropolis methods of Liu and Yu (1999), Liu and Sabatti 
(2000) and Kou, Xie and Liu (2005), pointing out that even though the latter also use transforma- 
tions, their goal is to seek improvement of the standard Gibbs and Metropolis Hastings algorithms by 
adding a transformation-based step, while TMCMC is an altogether new and general methodology 
for simulating from intractable, particularly, high-dimensional distributions. 

TMCMC is compared with MH using the well-known Challenger data, demonstrating the effec- 
tiveness of of the former in the case of highly correlated variables. Moreover, we apply our methodol- 
ogy to a challenging posterior simulation problem associated with the geostatistical model of Diggle 
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et al. (1998), updating 160 unknown parameters jointly, using a deterministic transformation of a 
one-dimensional random variable. Remarkable computational savings as well as good convergence 
properties and acceptance rates are the results. 

Keywords: Geostatistics; High dimension; Inverse transfromation; Jacobian; Metropolis-Hastings algo- 
rithm; Mixture proposal 

1 Introduction 

Markov chain Monte Carlo (MCMC) has revolutionized statistical, particularly, Bayesian computation. 
In the Bayesian paradigm, however complicated the posterior distribution may be, it is always possible, 
in principle, to obtain as many (dependent) samples from the posterior as desired, to make inferences 
about posterior characteristics. But in spite of the obvious success story enjoyed by the theoretical side of 
MCMC, satisfactory practical implementation of MCMC often encounters severe challenges, particularly 
in very high-dimensional problems. These challenges may arise in the form of the requirement of enor- 
mous computational effort, often requiring inversions of very high-dimensional matrices, implying the 
requirement of enormous computation time, even for a single iteration. Given that such high-dimensional 
problems typically converge extremely slowly to the target distribution triggered by complicated poste- 
rior dependence structures between the unknown parameters, astronomically large number of iterations 
(of the order of millions) are usually necessary. This, coupled with the computational expense of individ- 
ual iterations, generally makes satisfactory implementation of MCMC, and hence, satisfactory Bayesian 
inference, infeasible. That this is the situation despite steady technological advancement, is somewhat 
disconcerting. 

1.1 Overview of the contributions of this paper 

In an attempt to overcome the problems mentioned above, in this paper we propose a novel method- 
ology that can jointly update all the unknown parameters without compromising the acceptance rate, 
unlike in Metropolis-Hastings (MH) algorithm. In fact, we show that even though a very large num- 
ber of parameters are to be updated, these can be updated by simple deterministic transformations of 
a single, one-dimensional random variable, the distribution of which can be chosen very flexibly. As 
can be already anticipated from this brief description, indeed, this yields an extremely fast simulation 
algorithm, thanks to the singleton random variable to be flexibly simulated, and the subsequent simple 
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deterministic transformation, for example, additive transformation. It is also possible, maybe more ef- 
ficient sometimes, to generate more than one, rather than a single, random variables, from a flexible 
multivariate (generally independent), but low-dimensional distribution. We refer to our new methodol- 
ogy as Transformation-based MCMC (TMCMC). 

We show that by generating as many random variables as the number of parameters, instead of a 
single/few random variables, TMCMC can be reduced to a MH algorithm with a specialized proposal 
distribution. Another popular MCMC methodology, the hybrid Monte Carlo (HMC) method, which 
relies upon a specialized deterministic transformation, will be shown to be a special case of TMCMC. 

We also provide a brief overview of the transformation-based generalized Gibbs and Metropolis 
methods of Liu and Yu (1999), Liu and Sabatti (2000) and Kou, Xie and Liu (2005), and point out 
their differences with TMCMC, also arguing that TMCMC can be far more efficient at least in terms of 
computational gains. 

Apart from illustrating TMCMC on the well-known Challenger data set, and demonstrating its supe- 
riority over existing MH methods, we successfully apply TMCMC with the mere simulation of a single 
random variable, to update 160 unknown parameters in every iteration, in the challenging geospatial 
problem of Diggle et al. (1998). The computational challenges involved with this and similar geospatial 
problems have motivated varieties of MCMC algorithms and deterministic approximations to the poste- 
rior in the literature (see, e. g. Rue et al. (2009), Christensen et al. (2006) and the references therein). 
With our TMCMC algorithm we have been able to perform 5.5 x 10 7 iterations (in a few days) and obtain 
reasonable convergence. 

We also show how TMCMC can be adopted to significantly improve computational efficiency in dou- 
bly intractable problems, where the posterior, apart from being intractable, also involves the normalizing 
constant of the likelihood — the crucial point being that the normalizing constant, which depends upon 
unknown parameters, is also intractable. 

The rest of this article is structured as follows. In Section 2 we introduce our new TMCMC method 
based on transformations. The univariate and the multivariate cases are considered separately in Sections 
2.1 and 2.2 respectively. In Section 3 we study in details the role and efficiency of a singleton e in 
updating high-dimensional Markov chains using TMCMC. In Section 4 we provide a brief overview of 
HMC, and show that it is a special case of TMCMC. In Section 5 we provide a brief overview of the 
generalized Gibbs and Metropolis methods of Liu and Yu (1999), Liu and Sabatti (2000) and Kou, Xie 
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and Liu (2005), discuss their differences with TMCMC, and argue that the latter offers more advantages 
than generalized Gibbs/Metropolis in terms of computational savings. Illustration of TMCMC with 
singleton e using the Challenger data and comparison with a popular MCMC technique are provided in 
Section 6. Application of TMCMC with single e to the 160-dimensional geospatial problem of Diggle 
et al. (1998) is detailed in Section 7. Section 8 shows how TMCMC may be applied to the bridge- 
exchange algorithm of Murray et al. (2006) in doubly intractable problems to speed-up computation. 
Finally, conclusions and overview of future work are provided in Section 9. 

2 MCMC algorithms based on transformations on the state-space 

In this section we propose and study the TMCMC algorithms. First, we construct it for state-spaces 
of dimension one. This case is not of much interest because the state space is similar to the real line 
and numerical integration is quite efficient in this scenario. Nevertheless, construction of the TMCMC 
algorithm for one dimensional problems helps to generalize it to higher dimensions and points out its 
connections (similarities in one-dimension and dissimilarities in higher dimensions) with the MH algo- 
rithm. In Section 2.2 the TMCMC algorithm is generalized to higher dimensional state-spaces. 

2.1 Univariate case 

Before providing the formal theory we first provide an informal discussion of our ideas with a simple 
example involving the additive transformation. 

2.1.1 Informal discussion 

In order to obtain a valid algorithm based on transformations, we need to design appropriate "move 
types" so that detailed balance and irreducibility hold. Given that we are in the current state x, we can 
propose the "forward move" x' = x + e; here e > is a simulation from some arbitrary density of the 
form g(e)/( ,oo)(e)- To move back to x from x', we need to apply the "backward transformation" x' — e. 
In general, given e and the current state x, we shall denote the forward transformation by T(x, e), and 
the backward transformation by T b (x, e). 

The forward and the backward transformations need to be 1-to-l. In other words, for any fixed e, 
given x' the backward transformation must be such that x can be retrieved uniquely. Since this must hold 
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good for every x in the state space, the transfomation must be onto as well. Similarly, for any fixed e, 
there must exist x such that the forward transformation leads to arbitrarily chosen x' in the state space 
uniquely, implying that this transformation is 1-to-l, onto as well. If, given e and x', say, more than one 
solution exist, then return to the current value x can not be ensured, and this makes detailed balance, a 
requirement for stationarity of the underlying Markov chain, hard to satisfy. 

The detailed balance requirement also demands that, given x, the regions covered by the forward 
and the backward transformations are disjoint. On the other hand, the backward move covers the region 
where all the values are less than x. For example, in our additive transformation case, the forward trans- 
formation always takes x to some unique x', where x' > x. To return from x' to x, it is imperative that the 
backward transformation decreases the value of x' to give back x. Thus, if the forward transformation al- 
ways increases the current value x, the backward transformation must always decrease x. In other words, 
the regions covered by the two transformations are disjoint. Since x is led to x' by the forward trans- 
formation and x' is taken back to x by the backward transformation, we must have T(T b (x, e), e) = x. 
Also, the sequence of forward and backward transformations can be changed to achieve the same effect, 
that is, we must also have T b (T(x, e), e) = x. In the above discussion we indicated the use the same 
e for updating x to x' and for moving back from x' to x. An important advantage associated with this 
strategy is that whatever the choice of the density g(e)I( 0:OO )(e), it will cancel in the acceptance ratio of 
our TMCMC algorithm, resulting in a welcome simplication. 

Thanks to bijection each of the forward and the backward transformations will be equipped with their 
respective inverses. In general, we denote by T(x, e) and T b (x, e) the forward and the backward trans- 
formations, and by T^ 1 (x, e) and T b 1 (x, e) their respective inverses. Note that for fixed e, T^(x, e) = 
T b (x, e), and T b 1 (x, e) = T(x, e), but the general inverses must be defined by eliminating e. For in- 
stance, writing e = x'— x for the forward transformation yields T(x, e) = T(x,x'—x) = x+(x'—x) = x'. 
Defining T _1 (x, x') — x' — x, it then follows that T(x, T _1 (x, x'))= x'=T~ 1 (x, T(x, x')), showing that 
T _1 is the inverse of T in the above sense. Similarly, T h 1 can also be defined. 

2.1.2 Formal set-up 

Suppose T : X x V — > X for some V (possibly a subset of X) is a totally differentiable transformation 
such that 

1. for every fixed e ^ Mi, the transform 9 i — > T(9,e) is bijective and differentiable and that the 
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inverse is also differentiable. 



2. for every fixed x J\f 2 , the transform e i — > T(x, e) is injective. 

where M\ and M 2 are 7r -negligible sets. Further suppose that the Jacobian 

d(T(x,e),e) 



J(x, e) 



d(x, e) 



is non-zero almost everywhere. 

Suppose there is a subset y of V such that Vrr ^ J\f 2 the sets T(x, y) and T b (x, y) are disjoint, where 
T b (x, e) is the backward transformation defined by: 

T (T b (x, e), e) = T b (T(x, e),e)=x 

Example: Transformations on One dimensional state-space 

1. (additive transformation) Suppose X — V — R and T(x,e) = x + e. Let T b (x,e) = x — e. 
This transformation is basically the random walk if e is a random quantity. Notice that if we may 
choose y = (0, oo), then T(x, y) = (x, oo), T b (x, y) = (— oo, x) and we can characterize the 
transformation as a forward move or a backward move according as e e or ^ X Notice that here 
A/" is the empty set and for all x G the map x i — > T(x, e) is a bijection. 

2. (log-additive transformation) Suppose X = V = (0, oo) and T(x, e) = xe. For all rr G X, 

T b (x,e)=x/e.y = (0,l). 

3. (multiplicative transformation) Let X = R = V, T(x, e) = xe. Then Af\ = Af 2 = {0}, for all 
e ^ 0, T 6 (a;, e) = x/e. y = (-1, 1) - {0}. 

Suppose further that g is a density on y and that < p < 1. Then the MCMC algorithm based on 
transformation is given in Algorithm 2.1 



Algorithm 2.1 MCMC algorithm based on transformation (univariate case) 



• Input: Intial value Xq, and number of iterations N. 
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• For t = 0, . . . ,N - 1 

1. Generate e ~ g(-) and u ~ U(0, 1) independently 
2 . If < u < p, set 



(1 — p 71 (x^ \ 
1, — — r J(x,e)] 



3. Else if p < u < 1 set 



rr' = T b (x t , e) and e) = min I 1, 



1 p 7T{x') 1 



1-p 7r(x t ) J(x,e) 



4 . Set 

x' with probability a(x t ,e) 



Xt+l — 



x t with probability 1 - a(x t , e) 



• End for 



Notably, the acceptance probability is independent of the distribution g(-), even if it is not symmetric. 
The algorithm can be shown to be a special case of MH algorithm with the mixture proposal density: 



q(x -> z) =pg(T 1 (x,z)) 



dT-\x,z) 



dz 



+ (l-p)g(T b -\x,z)) 



I(zeT(x,y)) 

dT b ~\x,z) 



dz 



(2.1) 



l(zeT b (x,y)) 



where the inverses are defined by 



1. T(x, T-\x, z)) = z = T-\x, T(x, z)), V z e T(x, y) 

2. T b (x, T b ~\x, z)) = z = T b ~\x, T b (x, z)), V z e T b (x, y) 

Hence detailed balance holds for the above algorithm. This ensures that our TMCMC methodology has 
7r as the stationary distribution. Although in this univariate case TMCMC is an MH algorithm with the 
specialised mixture density (2.1) as the proposal mechanism, this proposal distribution becomes singular 
in general in higher dimensions. 

We remark that TMCMC maybe particularly useful for improving the mixing properties of the 
Markov chain. For instance, if there are distinct modes in several disjoint regions of state space, then 
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standard MH algorithms tend to get trapped in some modal regions, particularly if the proposal distri- 
bution has small variance. Higher variance, on the other hand, may lead to poor acceptance rates in 
standard MH algorithms. Gibbs sampling is perhaps more prone to mixing problems due to the lack of 
tuning facilities. For multimodal target distributions, mixture proposal densities are often recommended. 
For instance, Guan and Krone (2007) theoretically prove that a mixture of two proposal densities results 
in a "rapidly mixing" Markov chain when the target distribution is multimodal. Our proposal, which 
we have shown to be a mixture density in the one-dimensional case, seems to be appropriate from this 
perspective. Indeed, in keeping with this discussion, Dutta (2010), apart from showing that the multi- 
plicative transformation is geometrically ergodic even in situations where the standard proposals fail to 
be so, demonstrated that it is very effective for bimodal distributions. These arguments demonstrate that 
a real advantage of TMCMC (also of other transformation-based methods as in Liu (2001)) comes forth 
when the transformations associated with our method identify a subspace moving within which allows 
to explore regions that are otherwise separated by valleys in the probability function. Efficient choice of 
transformations of course depends upon the target distribution. 

In higher dimensions our proposal does not admit a mixture form but since the principles are similar, 
it is not unreasonable to expect good convergence properties of TMCMC in the cases of high-dimensional 
and/or multimodal target densities. In the multidimensional case, which makes use of multivariate trans- 
formations (which we introduce next), reasonable acceptance rates can also be ensured, in spite of the 
high dimensionality. This we show subsequently in Section 3.2, and illustrate with the Challenger data 
problem and particularly with the geostatistical problem. Moreover, the multivariate transformation 
method brings out other significant advantages of our method, for instance, computational speed and the 
ability to overcome mixing problems caused by highly correlated variables. 

2.2 Multivariate case 

Suppose now that X is a A;-dimensional space of the form X = YT i=1 X i so that T = (T 1? . . . , T k ) 
where each Tj : Xi x V — > Xi, for some set V, are transformations as in Section 2.1. Then for each 
nonempty subset / of {1, ... , k}, let Tf(x, e) be the backward transformations in /-coordinates of the 



map x i — > T(x, e), i.e. T\ = (g u g 2 , . . . , g k ) where 

f n if 




% e I 



i i I 
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and define 7j(x, e) = T(x, e). 

We see that T induces 2 k many types of 'moves' on the state-space. Suppose now that there is 
a subset y of X such that the sets Tj. (x, y) and (x, y) are disjoint for every subsets U ^ Ij of 
{1,...,*} 

Examples: Transformations on higher dimensional state-space 

1. (Additive transformation) Suppose X = V = R 2 , T(x,e) = (x 1 + a\e\,x 2 + a 2 e 2 ) where a x 
and a 2 are two (positive) scale parameters. Then T-f(x,e) = (#i — a\e^x 2 + a 2 e 2 ), ^(x, e) = 
(xi+aiei, X2— 02^2) and 2 |(x, e) = (xi — aiei,x 2 — 02^2)- We may choose y — (0, 00) x (0, 00). 

2. (Multiplicative transformation) Suppose X = V = R x (0, 00), T(x,e) = (xiei, x 2 e 2 ). Then 
T^(x,e) = (x 1 /e 1 ,x 2 e 2 ), T 2 6 (x,e) = (x 1 ei,x 2 /e 2 ) and T { 6 12} (x, e) = (xi/e 1 ,x 2 /e 2 ). We may let 

y = {(-1, 1) - {o}} x (0, 1). 

3. (Additive-multiplicative transformation) Suppose X — V — R x (0, 00), T(x, e) = (xi + ei, x 2 e 2 ). 
ThenT^x.e) = (xj - ei, x 2 e 2 ), T|(x,e) = (xi + ei, x 2 /e 2 ) and T { 6 lj2} (x, e) = (xi - e u x 2 /e 2 ). 
We may let y = (0, 00) x (0, 1). 

The above examples can of course be generalized to arbitrary dimensions. Also, it is clear that it 
is possible to construct valid transformations in high-dimensional spaces using combinations of valid 
transformations on one-dimensional spaces. 

Now suppose that g is a density on y, and let /1, ... , I 2 k be all the subsets of {1, . . . , k} with I± = <fi 
and I 2 k = {1, . . . , k}. Letp(Ji), . . . ,p(/ 2 fe ) be positive numbers summing to I. The MCMC algorithm 
based on transformations is given in Algorithm 2.2. 



Algorithm 2.2 MCMC algorithm based on transformation (multivariate case) 

• Input: Intial value and number of iterations N. 

• For t = 0, . . . , N - 1 

1. Generate e ~ (?(•) and an index i ~ A4(l;pi, . . . ,p 2 k) independently. Actually, 
simulation from the multinomial distribution is not necessary; 
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see Section 3.1 for an efficient and computationally inexpensive 
method of generating the index even when the number of move-types 
far exceeds 2 fc . 



x' = T>(xe),e) and <«»,«)=*(l,f ^ 

3. Set 

At) 



0(7$(x<*>,e),e) 



9(x(*),e) 



x 



(t+1) I x' with probability a(x w ,e) 
x (<) with probability l-«(x w ,e) 



End for 



In light of the above algorithm, it can be seen that for each of the transformations in the above examples, a 
mixture proposal of the form (2.1) is induced. It will, however, be pointed out in Section 3 that a singleton 
e suffices for updating multiple random variables simultaneously, which would imply singularity of the 
underlying proposal distribution. Notice that for arbitrary dimensions the additive transformation reduces 
to the random walk MH (RWMH). 

The detailed balance condition is proved as follows: Suppose y = Tj(x,e) G T|(x, y), then x = 
Tj c (y, e). Hence, the kernel K satisfies, 

7r(x)^(x^y) = tt(x) p(Z) g(e) min { 1, J,(x, e) } 

= <?(e) min{7r(x)p(/),7r(y)p(/ c ) J/(x,e)} 

and 

7r(y)K(y^x) = Tr(y) p(Z) g(e)J,(x, e) min 1 1, -frfr, e) } 

= #(e) min{7r(y)p(/ c )J 7 (x,e),7r(x)p(/)} 

where J/(x, e) = |<9(T 7 b (x, e),e)/9(x,e)| satisfies 

Jj c (7j(x,e),e) x J?(x,e) = 1 since 7%(7?(x,e),e) = x. 

Algorithm 2.2 indicates that updating highly correlated variables can be done naturally with TM- 
CMC: for instance, in Example 1 of this section one may select T(x, e) and 2 j (x, e) with high proba- 
bilties if x x and x 2 are highly positively correlated and T 2 6 (x, e) may be selected with high probability if 
xi and x 2 are highly negatively correlated. 
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3 Validity and usefulness of singleton e in implementing TMCMC 
in high dimensions 



Crucially, a singleton e suffices to ensure the validity of our algorithm, even though many variables are 
to be updated. This indicates a very significant computational advantage over all other MCMC-based 
methods: for instance, complicated simulation of hundreds of thousands of variables may be needed for 
any MCMC-based method, while, for the same problem, a single simulation of our methodology will do. 
Indeed, in Section 7 we update 160 variables using a single e in the geostatistical problem of Diggle etal. 
(1998). This singleton e also ensures that a mixture MH proposal density corresponding to our TMCMC 
method does not exist. The last fact shows that TMCMC can not be a special case of the MH algorithm. 
On the other hand, assuming that instead of singleton e, there is an e, associated with each of the variables 
Xi, % = 1, . . . , k, then again TMCMC boils down to the MH algorithm, and, as in the univariate case, 
here also our transformations would induce a mixture proposal distribution for the algorithm, consisting 
of 2 k mixture components each corresponding to a multivariate transformation. 

Using singleton e, for transformations other than the additive transformation, it is necessary to in- 
corporate extra move types having positive probability which change one variable using forward or 
backward transformation, keeping the other variables fixed at their current values. Consider for instance, 
Example 3 of Section 2.2. The example indicates that, with a singleton e, it is only possible to move from 
(xi,x 2 ) to either of the following states: (xi + e, x 2 e), (x± — e, x 2 e), (xi + e, x 2 /e) and (xi — e, x 2 /e) with 
positive probabilities. In addition, we could specify that the states (x 1: x 2 e), (x 1: x 2 /e), (xi + e, x 2 ) and 
(xi — e, x 2 ) also have positive probabilities to be visited from (xi,x 2 ) in one step. We will need to specify 
the visiting probabilities pi > 0;i = 1, . . . , 8 such that Yli=iPi = 1- A general method of specifying 
the move-type probabilities, which also preserves computational efficiency, is dicussed in Section 3.1. 
Inclusion of the extra move types ensures irreducibility and aperiodicity (the definitions are provided in 
the Appendix) of the Markov chain. It is easy to see that even for higher dimensions irreducibility and 
aperiodicity can be enforced by bringing in move types of similar forms that updates one variable keep- 
ing the remaining variables fixed. One only needs to bear in mind that the move types must be included 
in pairs, that is, a move type that updates only the i-th co-ordinate X; L using forward transformation and 
the conjugate move type that updates only Xi using the backward transformation both must have positive 
probability of selection. 
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This strategy works for all transformations, including the examples in Section 2.2 where we now 
assume equality of all the components of e. Only additional move types are involved for transformations 
in general. However, we prove in the Appendix that the additive transformation does not require the 
additional move types. Also taking account of the inherent simplicity of this transformation, this is our 
automatic choice for the applications reported in this paper. 

3.1 Flexible and computationally efficient specification of the move-type proba- 
bilities 

Consider a k (> 1) -dimensional target distribution, with associated random variables x = (x±, . . . , Xk). 
Then, in order to specify the move-type probabilities, we can implement the following simple rule. 
Given x, let the forward and the backward transformations be applied to Xi with probabilities pi and g«, 
respectively. With probability 1—pi—qi, X{ remains unchanged. For computational convenience, one may 
define a random variable z% that takes values —1, 0, 1, with probabilities q u 1 — — qi,Pi, respectively. 
The values —1,0, 1 corresponds to backward transformation, no change, and forward transformation, 
respectively. 

This rule is to be applied to each of i = 1, . . . , k coordinates. This rule then includes all possible 
move types, including the one where none of the x,- L is updated, that is, x is taken to x. Since the move- 
type x i — y x is redundant, this is to be rejected whenever it appears. In other words, we would keep 
simulating the discrete random vector (zi, . . . , Zk) until at least one Zi ^ 0, and would then select the 
corresponding move type. For any dimension, this is a particularly simple and computationally efficient 
exercise, since the rejection region is a singleton, and has very small probability (particularly in high 
dimensions) if either of pi and qt is high for at least one i. 

The above method implies that the probability of a move-type is of the form c ri^eSx Ph Y[i 2 &s 2 9»2 lL 3 6S-i — 
Pis ~ ?i 3 )' where Si U S 2 U S 3 = {1,2, ... ,k} and c is the normalizing constant, which arose due to 
rejection of the move type x4x. This normalizing constant cancels in the acceptance ratio, and so it is 
not required to calculate it explicitly, another instance of preservation of computational efficiency. 

For the additive transformation, the issues are further simplified. The random variable Zi here takes 
the value —1 and 1 with probabilities p,i and q i = l — p h respectively. So, only p, L needs to be specified. 
Since zi = has probability zero in this setup, there is no need to perform rejection sampling to reject 
any move-type. 
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Interestingly, the ideas developed in this section provides us with a handle to control the move-type 
probabilities, by simply controlling pi and for each i. For instance, if some pilot MCMC analysis tells 
us that Xi and xj are highly positively correlated, then we could set and pj (or and qj) to be high. 
On the other hand, if x { and xj are highly negatively correlated, then we can set p, L to be high (low) and 
qj to be low (high). 

3.2 Improved acceptance rates of additive TMCMC with singleton e compared 
to joint updating using RWMH 

Consider a continuous target density of k > 1 random variables, denoted 7r(x), where x = (x± : . . . , x*;). 
Assume further that 7r(-) is uniformly continuous function of x. The joint random walk MH algorithm 
generates e = (ei, . . . , e^)' independently from iV(0, 1), and then uses the transformation x\ = Xi + a^; 
we assume that > K for each i. Thus, the random walk MH updates (x±, . . . , xu) simultaneously in 
a single block. On the other hand, the additive-transformation based TMCMC also updates (xi, . . . , x^) 
simulatenously in a single block, but instead of using k different e h it uses a single e for updating 
all the Xi variables. In other words, for TMCMC based on additive transformation e is of the form 
e = (±e, . . . , ±e)', where e ~ N(0, l)I {e>0} . 

Finding the exact or asymptotic acceptance rate for a random walk MH algorithm for a general 
target density is still an unsolved problem in MCMC literature. In this article we try to give reasonable 
upper bounds to the acceptance rate of the additive TMCMC with singleton e and the random walk MH 
algorithm and show how the acceptance rate for the latter converges to zero faster than that for the former. 

Now, if R(x' | x) is the acceptance probability of x' given the current value x, then, for every 
r G (0, 1), due to the assumptions regarding the target density n(-), 

Pr(R(x' | x) < r) = Pr(\n(x.') - tt(x)| > Ci(r),7r(x / ) < vr(x)) 

< Pr (|| x' - x ||> c 2 (r), tt(x') < tt(x)) (3.1) 
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Now, for any positive constants c and e , 

Pr (|| x' -x ||< c,tt(x') < ?r(x)) < Pr (|| x' - x ||< c) 



<ME^<c7^ 2 



,i=l 



< $ 



{c 2 /K 2 ) - k 



2k 



+ e , for k > k (e ), 



(3.2) 



$(■) being the distribution function of N(0, 1) distribution. Hence, for any e > 0, 

/ k 

Pr (RWMH) (i?(x / | x ) < r ) >l-Pr[J2e 2 < c 2 2 (r)/K 2 



j=i 



> 1 - $ 



[c 2 /K 2 ) - k 



2k 



- e , for k > k (e ). 



(3.3) 



On the other hand, for additive TMCMC with singleton e ~ JV(0, l)I(e > 0), 

Pr (|| x' - x ||< c) < Pr (e < c/VkK^j 



= 2$ 



1. 



This implies that even in TMCMC with singleton e, for any r e (0, 1) it holds that 



p r (TMCMC) (jR(x , | x) < r) > 2 



1 - $ 



c 2 (r) 



(3.4) 



(3.5) 



Inequalities (3.3) and (3.5) show that under both RWMH and TMCMC, the acceptance probabilities are 
small with probability tending to 1 as the dimension k — >■ oo. However, under RWMH this goes to 
zero at a much faster rate than that under TMCMC. This is clear because the ratio of the argument of 
the increasing distribution function $(•) in (3.2) to that in (3.4) is — k*j which goes to — oo as 
k — >■ oo. Now, letting U ~ Uniform(0, 1), the acceptance rate is given by 

AR = J i?(x'|x)g(x / |x)7r(x)rfxrfx / 

= J J Pr{U < i?(x / |x))g(x / |x)rfx / 7r(x)rfx 

= J J Pr( J R(x'|x) > u)du 7r(x)rfx (3.6) 
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Since Pr (i?(x'|x) > u) is bounded above by 1, which is integrable in this set up, the dominated con- 
vergence theorem holds, showing that AR — >■ as k — > oo. In fact, for large k, (3.3) implies that the 
following inequality holds in the case of RWMH: 



AR (RWMH) < 



n A(u)l^)-k ]+fo)iu 



vr(x)rfx < sup $ v 2V ;/ ; + e , 

ue(o,i) V v2/c / 

(3.7) 



and, (3.5) implies that for any k, the following inequality holds for TMCMC: 

, J y> {» (^) _ !} *,] < s> {» (^) - 1} ■ (3.8) 

Comparison of the upper bounds in (3.7) and (3.8) shows that for large k, additive TMCMC will have a 
much higher upper acceptance rate as compared to that of joint RWMH. 

Standard methods like sequential RWMH may tend to be computationally infeasible in high dimen- 
sions while inducing mixing problems due to posterior dependence between the parameters, whereas 
TMCMC remains free from the aforementioned problems thanks to singleton e and joint updating of all 
the parameters. Specialised proposals for joint updating may be constructed for specific problems only, 
for instance, block updating proposals for Gaussian Markov random fields are available (Rue (2001)). 
But generally, efficient block updating proposals are not available. Moreover, even in the specific prob- 
lems, simulation from the specialized block proposals and calculating the resulting acceptance ratio are 
generally computationally very expensive. In contrast, TMCMC with singleton e seems to be much 
more general and efficient. Moreover, we demonstrate in Section 6 in connection with the Challenger 
data problem that TMCMC can outperform well-established block proposal mechanisms, usually based 
on the asymptotic covariance matrix of the maximum likelihood estimator (MLE), in terms of acceptance 
rate. But before illustrating TMCMC with real examples, we first investigate the relationship of HMC, 
another specialized MCMC method based on deterministic updating proposal, with TMCMC. 



4 Comparison of TMCMC with HMC 

Motivated by Hamiltonian dynamics, Duane et al. (1987) introduced HMC, an MCMC algorithm with 
deterministic proposals based on approximations of the Hamiltonian equations. We will show that this 
algorithm is a special case of TMCMC, but first we provide a brief overview of HMC. More details can 
be found in Liu (2001), Cheung (2009) and the references therein. 
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4.1 Overview of HMC 

If 7r(x) is the target distribution, a fictitious dynamical system may be considered, where x(t) e R d can 
be thought of as the rf-dimensional position vector of a body of particles at time t. If v(t) = x(t) = ^ 
is the speed vector of the particles, v(t) — ^ is its acceleration vector, and F is the force exerted on the 
particle; then, by Newton's law of motion F = mv(i) = (miVi, . . . , m d v d )(t), where m G l d is a mass 
vector. The momentum vector, p = mv, often used in classical mechanics, can be thought of as a vector 
of auxiliary variables brought in to facilitate simulation from 7r(x). The kinetic energy of the system is 
defined as W(p) = p'M _1 p, M being the mass matrix. Usually, M is taken as M = diag{mi, . . . , m d }. 

The target density 7r(x) is linked to the dynamical system via the potential energy field of the system, 
defined as U(x) = — log 7r(x). The total energy (Hamiltonian function), is given by H (x, p) = U (x) + 
W(p). A joint distribution over the phase-space (x, p) is then considered, given by 

/(x,p) oc exp{-#(x,p)} = 7r(x)exp (-p'M^p/S) (4.1) 

Since the marginal density of /(x, p) is 7r(x), it now remains to provide a joint proposal mechanism for 
simulating (x, p) jointly; ignoring p yields x marginally from n(-). 

For the joint proposal mechanism, HMC makes use of Newton's law of motion, derived from the law 
of conservation of energy, and often written in the form of Hamiltonian equations, given by 

m = _afl^p) = _ w(x) _ 

where VU (x) = 9U d ^ ■ The Hamiltonian equations can be approximated by the commonly used leap- 
frog algorithm (Hockney (1970)), given by, 

x(t + 6t) = x(t) + StM^ 1 |p(t) - ^Vf/(x(t))j (4.2) 

p(t + 5t) = p(t) - ^ {VU (x(t)) + VU (x(t + 5t))} (4.3) 

Given choices of M, St, and L, the HMC is then given by the following algorthm: 
Algorithm 4.1 HMC 

• Initialise x and draw p ~ N(0, M) . 
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• Assuming the current state to be (x,p), do the following: 

1. Generate p' ~ iV (0, M) ; 

2. Letting (x(0), p(0)) = (x, p') , run the leap-frog algorithm for L time 
steps, to yield (x", p") = (x(t + L8t), p(f + L5ij) ; 

3. Accept (x",p") with probability 

min {1, exp {-if (X", p") + H (x, p')}} , (4.4) 
and accept (x, p') with the remaining probability. 

In the above algorithm, it is not required to store simulations of p. Next we show that HMC is a special 
case of TMCMC. 

4.2 HMC is a special case of TMCMC 

To see that HMC is a special case of TMCMC, note that the leap-frog step of the HMC algorthm (Al- 
gorithm 4.1) is actually a deterministic transformation of the form g L : (x(0),p(0)) — > (x(L),p(L)) 
(see Liu (2001)). This transformation satisfies the following: if (x',p') = g L (x, p), then (x, — p) = 
S L (x',-p'). 

The Jacobian of this transformation is 1 because of the volume preservation property, which says 
thatifV(O) is a subset of the phase space, and if V(t) = {(x(i),p(t)) : (x(0),p(0)) G V(0)}, then the 
volume \V(t)\ = J J v ^dxdp = J J v ^dx.dp = |V(0)|. As a result, the Jacobian does not feature in 
the HMC acceptance probability (4.4). 

For any dimension, there is only one move type defined for HMC, which is the forward transfor- 
mation g L . Hence, this move type has probability one of selection, and all other move types which we 
defined in general terms in connection with TMCMC, have zero probability of selection. As a result, 
the corresponding TMCMC acceptance ratio needs slight modification — it must be made free of the 
move-type probabilities, which is exactly the case in (4.4). 

The momentum vector p can be likened to e of TMCMC, but note that p must always be of the same 
dimensionality as x; this is of course, permitted by TMCMC as a special case. 
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4.3 Comparison of acceptance rate for L = 1 with RWMH and TMCMC 



For L — 1, the proposal corresponding to HMC is given by (see Cheung (2009)) 

g(x'|x(i)) = N(x':/i(i),E(f)), (4.5) 

where (4.5) is a normal distribution with mean and variance given, respectively, by the following: 

fi(t) = x(t) + ^M-MfV log (tt(x(0)) (4.6) 

£(t) = 5tM _1 (4.7) 

Assuming diagonal M with rrii being the i-th diagonal element, the proposal can be re-written in the 
following more convenient manner: for i — 1, . . . , k, 

x\ = Xi (t) + €i, (4.8) 

where Si(t) denotes the i-th component of V log (7r(x(t))), and ~ N {\ &ts r ^ t \ ■ Assuming, as is 
usual, that = 1 for each i, it follows that 

II / 112 k , , o k 



= £(!) =E f ?~^(A), (4.9) 



« 2 

«=1 i=l 



where xlW is a non-central x 2 distribution with k degrees of freedom and non-centrality parameter 

^ = T~ Yli=i s i(t)- Since, as either k — > oo or A — >■ oo, 



V / 2(A; + 2A) 

it follows that for any positive constants c and e , 



A7V(0,1), (4.10) 



Pr (|| x'-x ||< c,7r(x') < ?r(x)) < Pr (|| x - x ||< c) 

\i=i 



^' c " 2 y )|+f °- forfcaw (4 - u) 



Comparing with (3.2) it follows that the ratio of in (4.1 1) to c2/ ^~ fc is 



! c 2 /5* 2 -l-- 



-oo if | — >■ oo. Thus, compared to (3.2), (4. 1 1) goes to zero at a much faster rate. 
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Hence, as in (3.3), using the same assumptions as in Section 3.2, 



Pr (HMC) (jR(x , | x ) < r) > 1 - Pr ( ^ ef < U 1 as fc^ oo. 



j=i 



(4.12) 



It follows that for k > k (e ), 



AR (HMC) < 



(cl(^)/^t 2 ) -A- A 
a/2(A + 2A) 



+ e > du 



7r(x)dx < sup $ 

ue(o,i) 



(c|H/^ 2 ) - A; -A 
v/2(A + 2A) 

(4.13) 



By the above arguments, if A/A; — )• oo as k — > oo, then AR( HMC ^ tends to at a rate much faster than 
AR( RWMH \ while AR^™ CMC "> goes to zero at the slowest rate. 



5 Generalized Gibbs/Metropolis approaches and comparisons with 
TMCMC 

It is important to make it clear at the outset of this discussion that the goals of TMCMC and gener- 
alized Gibbs/Metropolis methods are different, even though both use moves based on transformations. 
While the strength of the latter lies in improving mixing of the standard Gibbs/MH algorithms by adding 
transformation-based steps to the underlying collection of usual Gibbs/MH steps, TMCMC is an alto- 
gether general method of simulating from the target distribution which does not require any underlying 
step of Gibbs or MH. 

The generalized Gibbs/MH methods work in the following manner. Suppose that an underlying 
Gibbs or MH algorithm for exploring a target distribution has poor mixing properties. Then in order to 
improve mixing, one may consider some suitable transformation of the random variables being updated 
such that mixing is improved under the transformation. Such a transformation needs to chosen carefully 
since it is important to ensure that invariance of the Markov chain is preserved under the transformation. 
It is convenient to begin with an overview of the generalized Gibbs method with a sequential updating 
scheme and then proceed to the discussion on the issues and the importance of the block updating idea 
in the context of improving mixing of standard Gibbs/MH methods. 

Liu and Sabatti (2000) (see also Liu and Yu (1999)) propose simulation of a transformation from 
some appropriate probability distribution, and then applying the transformation to the co-ordinate to be 
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updated. For example, in a d-dimensional target distribution, for updating x = (27, x 2 , . . . , x d ) to x' = 
(x' 1 ,x 2 , ■ ■ ■ ,x d ), using an additive transformation, one can select e from some appropriate distribution 
and set x[ — 27 + e. Similarly, if a scale transformation is desired, then one can set x[ = 727, where 
7 must be sampled from some suitable distribution. The suitable distributions of e and 7 are chosen 
such that the target distribution is invariant with respect to the move x', the forms of which are provided 
in Liu and Sabatti (2000). For instance, if ir(-) denotes the target distribution, then for the additive 
transformation, e may be sampled from ir(x 1 + e, 27, ■ ■ ■ , x d ), and for the multiplicative transformation, 
one may sample 7 from |7|7r(727, 27, • • • , x d ). Since direct sampling from such distributions may be 
impossible, Liu and Sabatti (2000) suggest a Metropolis -type move with respect to a transformation- 
invariant transition kernel. 

Thus, in the generalized Gibbs method, sequentially all the variables must be updated, unlike TM- 
CMC, where all the variables can be updated simultaneously in a single block. Here we note that for 
irreducibility issues the generalized Gibbs approach is not suitable for updating the variables blockwise 
using some transformation that acts on all the variables in a given block. To consider a simple exam- 
ple, with say, d = 2 and a single block consisting of both the variables, if one considers the additive 
transformation, then starting with x = (27,27), where 27 < x 2 , one can not ever reach x' = (x' 1: x 2 ), 
where x[ > 0, x 2 < 0. This is because x[ = 27 + z and x' 2 = x 2 + z, for some z, and x[ > 0, x' 2 < 
implies z > —27 and z < —x 2 , which is a contradiction. The scale transformation implies the move 
x = (27, . . . ,x d ) — > (7x1, . . . ,72^) = x'. If one initialises the Markov chain with all components 
positive, for instance, then in every iteration, all the variables will have the same sign. The spaces where 
some variables are positive and some negative will never be visited, even if those spaces have positive (in 
fact, high) probabilities under the target distribution. This shows that the Markov chain is not irreducible. 
In fact, with the aforementoned approach, no transformation, whatever distribution they are generated 
from, can guarantee irreducibility in general if blockwise updates using the transformation strategy of 
generalized Gibbs is used. 

Although blockwise transformations are proposed in Liu and Sabatti (2000) (see also Kou, Xie and 
Liu (2005) who propose a MH-based rule for blockwise transformation), they are meant for a different 
purpose than that discussed above. The strength of such blockwise transformations lies in improving 
the mixing behaviour of standard Gibbs or MH algorithms. Suppose that an underlying Gibbs or MH 
algorithm for exploring a target distribution has poor mixing properties. Then in order to improve mix- 
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ing, one may consider some suitable transformation of the set of random variables being updated such 
that mixing is improved under the transformation. This additional step involving transformation of the 
block of random variables can be obtained by selecting a transformation from the appropriate probability 
distribution provided in Liu and Sabatti (2000). This "appropriate" probability distribution guarantees 
that stationarity of the transformed block of random variables is preserved. Examples reported in Liu 
and Sabatti (2000), Muller (2005), Kou, Xie and Liu (2005), etc. demonstrate that this transformation 
also improves the mixing behaviour of the chain, as desired. 

Thus, to improve mixing using the methods of Liu and Sabatti (2000) or Kou, Xie and Liu (2005) one 
needs to run the usual Gibbs/MH steps, with an additional step involving transformations as discussed 
above. This additional step induces more computational burden compared to the standard Gibbs/MH 
steps, but improved mixing may compensate for the extra computational labour. In very high dimensions, 
of course, this need not be a convenient approach since computational complexity usually makes standard 
Gibbs/MH approaches infeasible. Since the additional transformation-based step works on the samples 
generated by standard Gibbs/MH, impracticality of the latter implies that the extra transformation-based 
step of Liu and Sabatti (2000) for improving mixing is of little value in such cases. 

It is important to point out that the generalized Gibbs/MH methods can be usefully employed by even 
TMCMC to further improve its mixing properties. In other words, a step of generalized Gibbs/MH can 
be added to the computational fast TMCMC. This additional step can significantly improve the mixing 
properties of TMCMC. That TMCMC is much faster computationally than standard Gibbs/MH methods 
imply that even in very high-dimensional situations the generalized Gibbs/MH step can ve very much 
successful while working in conjunction with TMCMC. 

In the next section we illustrate implementation of TMCMC with singleton e using the much-studied 
Challenger data. 

6 Application of TMCMC to the Challenger dataset 

In 1986, the space shuttle Challenger exploded during take off, killing the seven astronauts aboard. The 
explosion was the result of an O-ring failure, a splitting of a ring of rubber that seals the parts of the 
ship together. The accident was believed to be caused by the unusually cold weather (31°F or 0°C) at the 
time of launch, as there is reason to believe that the O-ring failure probabilities increase as temperature 
decreases. The data are provided in Table 6.1 for ready reference. We shall analyze the data with the 
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help of well-known logit model. Our main aim is not analyzing and drawing inference since it is done 
already in Dalai et al. (1989), Martz and Zimmer (1992) and Robert and Casella (2004) . We shall rather 
compare the different MCMC methodologies used in Bayesian inference for logit-model. 



Flight no. 


Failure 


Temp 


Flight no. 


Failure 


Temp 


14 


1 


53 


2 


1 


70 


9 


1 


57 


11 


1 


70 


23 


1 


58 


6 





72 


10 


1 


63 


7 





73 


1 





66 


16 





75 


5 





67 


21 


1 


75 


13 





67 


19 





76 


15 





67 


22 





76 


4 





68 


12 





78 


3 





69 


20 





79 


8 





70 


18 





81 


17 





70 









Table 6.1: Challenger data. Temperature at flight time (degrees F) and failure of O-rings (1 stands for 
failure, for success). 

Let 

Vi = A + foXi 

where Xi = U/ max U, U's being the temperature at flight time (degrees F), % — 1, . . . , n. and n = 23. 
Also suppose i/i is the indicator variable denoting failure of 0-ring. We suppose y^'s independently follow 
Bernoulli(7r(a;j)). 

In the logit model we suppose that the log-odd ratio is a linear function of temperature at flight time, 

i.e., 

log — ^— = r] = /3 1 + /3 2 x 

1 — 7T 

which gives 

7ii = exp(7fc)/(l + exp(77i)) 
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We construct an appropriate additive transformation T : IR 2 x R — > M 2 as follows. First, we consider the 
form T((/?i, p 2 ), e) = (pi, P2)' + LB(e, e)', where L is a diagonal matrix with tuning parameters on its 
diagonal, and B is a 2 x 1 vector, obtained using Cholesky decomposition, is such that BB' C, the 
large sample covariance matrix of the maximum likelihood estimator of (Pi, p 2 )' . So the entries of B are 
approximate variances of the m.l.e of (Pi, Pi). For our purpose, we set L equal to the identity matrix. 
Thus, we finally obtain the transformation 

T((pi, p 2 ), e) = (pi + 7.3773e, p 2 + 4.3227e) 

and use algorithm 2.2 with y — (0, 00) and 

g(e) oc exp(-e 2 /2), e > 

i.e. the N(0, 1) distribution truncated to the left at zero. From the covariance matrix C we observe that 
the correlation of Pi and /3 2 is approximately —0.99 and hence from our discussion towards the end of 
Section 2.2 setting high probabilities to the moves T^(x, e) and Tf(x, e) should facilitate good mixing. 
For our purpose we setp(</>) — p({l,2}) = 0.01 and p({l}) = p({2}) = 0.49. 

Also for comparison we use the RWMH algorithm (both joint and sequential updation) and also the 
MH algorithm with proposal q(P'\P) = N(0, S) where £ = h 2 C (we take h = 1 for our purpose) with 
C being the large sample covariance matrix of the MLE P of 0. Table 6.2 gives the posterior summaries 
and Figure 6.1 gives the traceplots of Pi and p 2 for TMCMC sampler and the MH sampler. It is seen that 
the mixing is excellent even though a single e has been used. 

Notice the excellent result of the MCMC based on transformations. The acceptance ratio is almost 
twice as large as those for other two MH algorithms. As remarked towards the end of Section 3.2, indeed 
TMCMC outperformed the MH block proposal based on the large sample covariance matrix of the MLE 
of P in terms of acceptance rate. Also for implementing TMCMC we need to simulate only one e in each 
step. In the RWMH with sequential updation and in MH based on bivariate normal proposal we need 
two such e's. In the RWMH we need to calculate the likelihood twice in each iteration. So, TMCMC 
dominates the other two in this respect. It can be easily anticipated, in light of the theoretical arguments 
in Section 3.2, that for joint RWMH the acceptance rate would be even lower. In Section 7, where 
we consider a 160-dimensional problem, we show as indicated by the calculations in Section 3.2, that 
TMCMC outperforms joint RWMH by a substantially large margin in terms of acceptance rate. 
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variable 


method 


aeeeptanee rate 

(%) 


mean 


std 


2.5%* 


25%* 


50%* 


75%* 


97.5%* 


01 


RWMH 


42.17 


19.119 


8.078 


4.909 


13.481 


18.475 


24.227 


38.176 


MH 


42.60 


18.930 


8.513 


5.011 


12.823 


17.981 


23.957 


38.206 


TMCMC 


73.23 


18.973 


7.944 


4.970 


12.881 


16.210 


21.685 


37.877 


02 


RWMH 


48.14 


-23.724 


9.613 


-46.272 


-29.786 


-22.984 


-17.019 


-6.7792 


MH 


42.60** 


-23.491 


10.128 


-46.461 


-29.464 


-22.353 


-16.261 


-6.956 


TMCMC 


73.23** 


-23.165 


9.762 


-46.404 


-28.891 


-22.282 


-16.446 


-7.026 



Table 6.2: Summary of the posterior samples based on MCMC runs of length 100,000 out of which first 20,000 
samples are discarded as burn-in. 

RWMH = Random walk Metropolis-Hastings, MH = Metropolis-Hastings with bivariate normal proposal, TM- 
CMC = MCMC based on transformation 
* : posterior sample quantiles. 

**: same as acceptance ratio for /3i since updated jointly. 

7 Application of TMCMC to the geostatistical problem of radionu- 
clide concentrations on Rongelap Island 

7.1 Model and prior description 

We now consider the much analysed radionuclide count data on Rongelap Island (see, for example, 
Diggle et al. (1997), Diggle et al. (1998), Christensen (2004), Christensen et al. (2006)), and illustrate 
the performance of TMCMC with a singleton e. For % — 1 , . . . , 157, Diggle et al. ( 1 998) model the count 
data as 

Yi ~ Poisson(Mi), 

where 

Mi = t i exp{p + S{x i )}; 

U is the duration of observation at location Xj, f3 is an unknown parameter and S(-) is a zero-mean 
Gaussian process with isotropic covariance function of the form 

Cov (S(z 1 ), S(z 2 )) = a 2 exp{- (a || z x - z 2 \\) s } 

for any two locations Zi, z 2 . In the above, || ■ || denotes the Euclidean distance between two locations, 
and (<r 2 , a, 5) are unknown parameters. Typically in the literature 5 is set equal to 1 (see, e. g. Christensen 
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et al. (2006)), which we adopt. We assume uniform priors on the entire parameter space corresponding 

to (/3,log(<x 2 ),log(a)). 

We remark that since the Gaussian process S(-) does not define a Markov random field, the block 
updating proposal developed by Rue (2001) is not directly applicable here. Rue et al. (2009) attempt to 
develop deterministic approximations to latent Gaussian models, but the scope of such approximations 
is considerably restricted by the conditional independence (Gaussian Markov random field) assumption 
(Banerjee (2009)). Thanks to the generality and efficiency of our proposed methodology, it seems most 
appropriate to fit the Rongelap island model using TMCMC with singleton e. 

7.2 Results of additive TMCMC with singleton e 

Drawing e ~ iV(0, l)I(e > 0), we considered the following additive transformation 

T((3,e)=(3±2e, 
T(log(a 2 ), e ) = log(a 2 )±5 e , 
T(log(a),e) = log(a)±5e, 
T(S(xi), e) = S(xi) ± 2e; for i = 1, . . . , 157 

The scaling factors associated with e in each of the transformations are chosen on a trial- and-error basis 
after experimenting with several initial (pilot) runs of TMCMC. We assigned equal probabilities to all 
the 2 160 move types. Move types are selected by simply generating '+' or ' — ' with equal probabilities 
and plugging in the sign in each of the 160 individual transformations. 

After discarding the first 2 x 10 7 iterations as burn-in, we stored 1 in every 100 iterations in the next 
3.5 x 10 7 iterations. This entire simulation took about a week to run on an ordinary laptop machine and 
about 3 days on a workstation. The autocorrelation functions of the variables (after further thinning by 
10) of our TMCMC run, displayed in Figure 7.1, indicates reasonable mixing properties. The acceptance 
rate, after discarding the burn-in period, is 0.43% (considering the complete run of TMCMC after burn- 
in, that is, including thinning as well). 

7.3 Comparison with joint RWMH 

We also implemented a joint RWMH using the same additive transformation as in Section 7.2 but with 
different e's for each unknown. Now the acceptane rate reduced to 0.0005%. These observations are 
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Figure 7.1: Autocorrelation plots of the variables, ct,/3, log a 2 (left-panel) and si, . . . , S157 (right-panel) 
in the TMCMC run. 

exactly in keeping with the theoretical discussion presented in Section 3.2. In fact, referring to the 
calculations presented in that section, note that, with K — 2, c = 0.1, and k = 160, $ ^^~ k ^j 
1.874419 x 10~ 19 , which corresponds to RWMH, while 2$ - 1 « 0.003, corresponding to 

TMCMC. 



8 Application of TMCMC to doubly-intractable problem 

Doubly-intractable distributions arise quite frequently in fields like circular statistics, directed graphical 
models, Markov point processes etc. Even some standard distributions like gamma and beta involve 
untractable normalizing constants. Formally, a density h(y\6) of the data set y = . . . , y n )' is said to 
be doubly-intractable if it is of the form 

h(y\e) = f(y\e)/Z(0) 

where Z(9) is a function that is not available in closed form. So if we put a prior it (9) on 9, then the 
posterior is given by 

^w^W^ where c(y) = /w T( " )<i9 



Thus, if we try to apply MH like algorithms then the acceptance ratio will involve ratio of the function 
Z(-) at two paramter points 9 and 9'. Hence directly applying MH may not be feasible. Works by M0ller 
et al. (2004) and Murray et al. (2006) are significant in this field. A double MH sampler approach is 
taken in Liang (2010). In this section we briefly discuss the bridge-exchange algorithm by Murray et al. 
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(2006) and show how our application of TMCMC in the bridge-exchange algorithm may facilitate fast 
computation. 

Suppose M E N is the bridge size, f3 m = mj (M +1), m — 0, . . . , M. Define the density 

p m (x|0, 9') oc f{x\ey™f{x\e') 1 -^ = / m (x|0, 9'), m — 0, . . . ,M. 

Obviously, x is of the same dimensionality as y; that is, x = (x±, . . . , x n )'. Further suppose that for each 
m, T m (x — > x.'\9, 9') is a kernel satisfying the detailed balance condition 

T m (x -> x'|0,0>m(x|M') = T m (x' -> x|0,0>m(x'|0,0')- 

Now with a proposal density q(9 — > 0'|y) for the parameter, the bridge-exchange algorithm is given 
below. 

Algorithm 8.1 The bridge-exchange algorithm 

• Input: initial state 9 0r length of the chain N, #bridge 
levels M. 

• For t = 0,...,N -1 

1. Propose 0' ~ q(9' «- 9 t \y) 

2. Generate an auxiliary variable with exact sampling: 

x ~po(xo|MO = /(xoK)/^) 

3. Generate M further auxiliary variables with transition 
operators : 

Xl ~ T^xo^x^,^) 
x 2 ~ T 2 ( Xl ^x 2 |0,0') 

xm ~ Tm(x m _i ->■ x M |0,0') 



4 . Compute 

1 <J ? (0^ 0'|y)7r(0)/(y|0) JU / m (x m |0,0') 
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Set 

f 9' with probability a(9' <- 9 t ) 
9 t with probability 1 - a(9' <- t t ) 



9t+i — 



• end for 



Now we see that, since each of the auxiliary variables x m , m = 1, . . . , M, is n-dimensional, gen- 
eration of these auxiliary variables may be computationally demanding if the sample size n is moderate 
or large especially when one has to simulate from the sample space using accept-reject algorithms as in 
the case of circular variables. For any kernel T m which is not based on TMCMC, 0(nM) variables are 
required to be generated from the state-space per iteration. Appealing to TMCMC, recall that with the 
additive transformation with a single e, the kernel still satisfies the detailed balance condition. 

We assume that X is a group under some binary operation and that there is a homomorphism from 
(IR P , +) to X for some p E N. So we denote the binary operation on X by '+' itself. Let g be a density 
on X. We construct the kernels T m as follows: 



Algorithm 8.2 Construction ofT„ 



1. Generate e ~ g(e) and a subset J of {1,2, ...,n} 

2. define the vector x' by 



x' = 



x m -i,i + if i i J 
x m -i,i - die if i e J 



3 . Set a(x m _ 1 — > x') = min 

4 . Set 



f p(J c ) f m (x'\9,9>) 
\p(J) f m (x\9,9>) ' 



x' with probability a(x m _i — > x') 
x m _! with probability 1 — a(x m _! — > x') 



In this way we need only O(M) simulations per iteration. Homomorphism from (W, +) to X holds 
in many cases, for example, in circular models where the state-space is (— n, ir] is a group with respect 
to addition modulo ir. 
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Figure 8.1: Left panel: Traceplot of last 1,000 samples. Right panel: exact posterior density of v (solid 
line) and it's estimate (dash-dotted line). 

8.1 Simulation study to illustrate TMCMC in bridge-exchange algorithm 

Here we illustrate our method for a circular model of the form 

Kv\ v ) = ^rT ex P( cos (l/ + fsin(y))), - 7T < y, v < ir, 

We generate a sample of size 20 from h(y\u = 0) and estimate the parameter v based on this sample. The 
prior chosen on v is the uniform distribution on (— ir, n] and g(-) is chosen to be the normal distribution 
with mean and variance 1 restricted on the set (0, n}. Since the components of x are iid, we used 
p(I) = l/2 n for each subset / of {1, 2, . . . , n} and a, L = 1 for each i. We set M = 100 and chose q{y'\v) 
to be the Von-mises distribution with mean v and concentration 0.5 to keep the acceptance level around 
63%. 

The right panel of Figure 8.1 shows that the estimated posterior density of v is very close to the exact 
posterior density. The little discrepancy at the tails are due to the fact that v is a circular variable and 
hence its support is (— 7r, it] and the density is not zero at the end points - a fact that is not incorporated in 
the kernel density estimator. The left panel of the same figure shows that the mixing is excellent. Notice 
that here we have saved 100(nM — M)/nM = 95% simulations. 
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9 Summary, conclusions and future work 



In this paper we have proposed a novel MCMC method that uses deterministic transformations and 
move types to update the Markov chain. We have shown that our algorithm TMCMC generalizes the 
MH algorithm boiling down to MH with a specialized proposal density in one-dimensional cases. For 
higher dimensions if each component x { of the random vector to be updated is associated with a distinct 
€i, then TMCMC again boils down to the MH algorithm with a specialised proposal density. But in 
dimensions greater than one, with less number of distinct than the size of the random vector to be 
updated, TMCMC does not admit any MH representation. That HMC is also a special case of TMCMC, 
is also explained. We also contrasted TMCMC with the transformation-based methods of Liu and Yu 
(1999), Liu and Sabatti (2000), and Kou, Xie and Liu (2005). 

The advantages of TMCMC are more prominent in high dimensions, where simulating a single ran- 
dom variable can update many parameters at the same time, thus saving a lot of computing resources. 
That many variables can be updated in a single block without compromising much on the acceptance 
rate, seems to be another quite substantial advantage provided by our algorithm. We illustrated with 
examples that TMCMC can outperform MH significantly, particularly in high dimensions. The compu- 
tational gain of using TMCMC for simulations from doubly intractable distributions, is also significant, 
and is illustrated with an example. 

The ideas developed in this paper are not confined to continuous target distributions, but also to 
discrete cases. For the sake of illustration, we consider two examples below. 

(i) Consider an Ising model, where, for % — 1, . . . , k (k > 1), the discrete random variable Xi takes the 
value +1 or —1 with positive probabilities. We then have X = {—1,1}. To implement TMCMC, 
consider the forward transformation T(x,i, e) = sgn(xi + e) with probability p^, and choose the 
backward transformation as T b (xi, e) = sgn(xi — e) with probability 1 — p^. Here sgn(a) = ±1 
accordingly as a > or a < 0, and y = (1, oo). Note the difference with the continuous cases. 
Here even though neither of the transformations is 1-to-l or onto, TMCMC works because of 
discreteness; the algorithm can easily be seen to satisfy detailed balance, irreducibility and aperi- 
odicity. However, if k — 1 with x\ being the only variable, then, if x\ = 1, it is possible to choose, 
with probability one, the backward move-type, yielding T b (xi,e) = —1. On the other hand, 
if xi = —1, with probability one, we can choose the forward move-type, yielding T(x±, e) = 1. 
Only 2 fc move-types are necessary for the /c-dimensional case for one-step irreducibility. In discrete 
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cases, however, there will be no Jacobian of transformation, thereby simplifying the acceptance 
ratio. 

(ii) For discrete state spaces like Z fc , (Z = {0, ±1, ±2, . . .}) the additive transformation with single 
epsilon does not work. For example, with k = 2, if the starting state is (1,2) then the chain will 
never reach any states (x, y) where x and y have same parity (i.e. both even or both odd) resulting a 
reducible Markov chain. Thus in this case we need to have more move-types than 2 k . For example, 
with some positive probability (say r) we may select a random coordinate and update it leaving 
other states unchanged. With the remaining probability (i.e. 1 — r) we may do the analogous 
version of the additive transformation: 

Let y = [1, oo). Then, can choose the forward transformation for each coordinate as T^Xj, e) = 
Xi + [e] and the backward transformation as Tf(x;,e) = Xi — [e], where [a] denotes the largest 
integer not exceeding a. 

This chain is clearly ergodic and we still need only one epsilon to update the states. 

However, in discrete cases, TMCMC reduces to Metropolis-Hastings with a mixture proposal. But it is 
important to note that the implementation is much efficient and computationally cheap when TMCMC- 
based methodologies developed in this paper, are used. 
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APPENDIX 



A Convergence properties of additive TMCMC 

In this section we prove some convergence properties of the TMCMC in the case of the additive transfor- 
mation. Before going into our main result we first borrow some definitions from the MCMC literature. 

Definition 1 (Irreducibility) A Markov transition kernel K is <p— irreducible, where ip is a nontrivial 
measure, if for every x G X and for every measurable set A of X with (f(A) > 0, there exists n G N, 
such that K n (x, A) > 0. 

Definition 2 (Small set) A measurable subset E of X is said to be small if there is an n G N, a constant 
c > 0, possibly depending on E and a finite measure v such that 

P n {x,A)>cu{A), VAeB{X),VxeE 

Definition 3 (Aperiodicity) A Markov kernel K is said to be periodic with period d > if the state- 
space X can be partitioned into d disjoint subsets X±, X 2 , . . . , X& with 

K(x, X i+1 ) = 1 V x e X h % = 1, 2, . . . , d - 1 

and K(x, A\) = 1 V x G X d . 

A Markov kernel K is aperidic if for no d G N it is periodic with period d. 

A.l Additive transformation with singleton e 

Consider now the case where X = R. k , V — R and T(x, e) = x + ae where a is a fc-vector with strictly 
positive entries. In this case y = [0, oo). Suppose that g is a density on y. 

Theorem 1 Suppose that ir is bounded and positive on every compact subset ofM. k and that g is positive 
on every compact subset of (0, oo). Then the chain is \-irreducible, aperiodic. Moreover every nonempty 
compact subset ofM. k is small. 

Proof 1 Without loss we may assume all the entries of a are 1 's. For notational convenience we shall 
prove the theorem for k = 2. The general case can be seen to hold with suitably defined 'rotational' 
matrices on M. k similar to (A.l). 
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Suppose E is a nonempty compact subset of IR fe . Let C be a compact rectangle whose sides are 
parallel to the diagonals {(x, y) : \y\ = \x\} and containing E such that A(C) > 0. We shall show that 
E is small, i.e., 3 c > such that 

K 2 (x,A)>c\ c (A) VA G B(R 2 ) andVx G E. 

It is clear that the points reachable from x in two steps are of the form 



x 1 ±e 1 ± e 2 
x 2 ± ei ± e 2 



ei > 0,e 2 > 



TTzws, z/we define the matrices 



Mi = | | 1 J M 2 = f 1 | | M 3 = | | M 4 



Mi = | 1 j J M 2 = J| 1 I M 3 = I | M 4 




-1 


-1 


-1 




-1 




1 





(A.1) 



then the points reachable from x in two steps, other than the points lying on the diagonals passing 
through x itself, are of the form 

x + Midl) and x + M(^), e ± > 0, e 2 > 0, i = 1, . . . , 4. 

m = inf 7r(y) > M = sup7r(y) < oo a = inf (7(e) > 
yec* ye c* o<e<i? 

where R is the length of the diagonal of the rectangle C 1 . Fix an element x££. For any se£ A e i3(IR 2 ), 

let A* = A fl C and de/wze, 

= {e G (0, cx)) 2 : x + Me G A*} 

(A.2) 

= {e G (0, oo) 2 : x + M^e G A*} 
r/ze need for defining such sets illustrated in the following example: to make a transition from the state 
~x.to a state in A* in two steps, first making a forward transition in both coordinates and then a forward 
transition in first coordinate and a backward transition in the second coordinate is same as applying the 
transformation x — > x + M±e for some e G A± in two steps, i.e. first 

x ->■ x + Mi(ei,0) T = x+ (ei,ei) T then x + M^ei, e 2 ) T ->■ x + M x e 



'Actually R/y/2 suffices. 
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Also note that for any e — (e±, e 2 ) G A it A* C C implies that the intermediate point x + Mj(ei, 0) T G C 
and similarly for A\ (i = 1, . . . , 4). Now, with p and p as the minimum and maximum of the move 
probabilities {p(I) \ I C {1,2}} 

K\x,A) > K 2 (x,A*) 

> P ^ / 9(e 1 )g(e 2 ) mm = — , 1 mm ^v , 1 de,de 2 



1=1 A 



1=1 A, 

, 2 / : 



pvr(x) 'J I p7r(x + M i (ei,0) r ) 

■fM i (ei,0) T ) 1 . fp7r(x + M i (e 1 ,e 2 ) 
— , 1 > mm < = = 



+ P_L ^1)^2) mm <^ = — , 1 nun ~ 1 ) de.de, 



= pV(mmj|^,lj) x2x^A(A 4 ) (A.3) 
Smce (ei,e 2 ) G -<=>- (62,^1) G A, 50 ?/ia?, A(A) = X(Ai). Now notice that, if we define for 

1 = 1,... A 

fi : (0, oo) 2 -)• i 2 9 £ K x + Mi€ 

and 

A x = {(e, 0) T : e > 0, (x x ± e, a; 2 ± e) G A*} 

4 44 
A* = \jMA t UA x ) =► A(A*) = ^/ i (A) = 2x^A(M 
i=i «=i i=i 

smce, f^AA's are pairwise disjoint, X(fi(A x )) = and A(/j(Aj)) = 2A(Aj) for 1 < z < 4. It follows 

from (A.3) ?/za? 



K 2 (x,A) > pV(^min||^,l|j 2 A(A*) = c\ c {A) 

where c = p 2 a 2 ( min < =— -, 1 I | > 0. 
This completes the proof that E is small. 

That the chain is irreducible, follows easily, for any x, the set {x} is a compact set and for a mea- 
surable set A with \(A) > we may choose C in the first part of the proof such that A(C D A) > 0. 
Now, 

K 2 (x,A) > c\(Cr\A)>0 
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Also aperiodicity follows trivially from the observation that any set with positive X-measure can be 
accessed in at most 2 steps. 
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